Ryegrass, a potential source of gluten-like proteins

Sophia Escobar-Correas

CSIRO Agriculture & Food

Introduction

Hello! my name is Sophia, I am a molecular biologist working in proteomics, currently a Postdoctoral fellow. Before Data School, I coded on Macro (Excel). I used to spend a lot of time cleaning and tidying protein data, I always felt like I could do it faster if I had programming skills. These weeks learning R have changed my daily work. The possible things that we could do have opened my mind to a new perspective of my research.

My Project

I am working in the study of ryegrass, a potential source for gluten peptides contamination. Gluten refers to a class of storage proteins found in cereal grains, including wheat, rye, barley, and oats. Consumption of these gluten proteins leads to an autoimmune response in the case of coeliac disease. Previous studies have identified gluten-like proteins in ryegrass. Since this is a common weed found in grain fields, there is a possibility of cross-contamination. First of all, I need to characterize the gluten proteins and peptides with ryegrass origen. To do so I have performed Data-dependent mass spectrometric analysis. The results of this study provided identification of 3162 proteins, and 8231 peptides. Now what I need to do is found how many of them are gluten-like.

Preliminary results

I will analyse the amino acids composition of the proteins of my database. Since gluten proteins have a high composition of the amino acids Glutamine (Q) and Proline (P). I will search for all proteins that have over 20% Glutamine.

Tables
Table 1: Protein Database
N Accession Name Sequence
1 spP4910614331_MAIZE 14-3-3-like MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEEGRGNEDRVTLIKDYRGKIETELTKICDGILKLLETHLVPSSTAPESKVFYLKMKGDYYRYLAEFKTGAERKDAAENTMVAYKAAQDIALAELAPTHPIRLGLALNFSVFYYEILNSPDRACSLAKQAFDEAISELDTLSEESYKDSTLIMQLLRDNLTLWTSDISEDPAEEIREAPKRDSSEGQ
2 spQ84Q72HS181_ORYSJ 18.1 MSLIRRSNVFDPFSLDLWDPFDGFPFGSGSRSSGSIFPSFPRGTSSETAAFAGARIDWKETPEAHVFKADVPGLKKEEVKVEVEDGNVLQISGERSKEQEEKTDKWHRVERSSGKFLRRFRLPENTKPEQIKASMENGVLTVTVPKEEPKKPDVKSIQVTG
3 spP69555PSBH_WHEAT Photosystem MATQTVEDSSKPRPKRTGAGSLLKPLNSEYGKVAPGWGTTPFMGVAMALFAIFLSIILEIYNSSVLLDGILTN
4 spP36886PSAK_HORVU Photosystem MASQLSAMTSVPQFHGLRTYSSPRSMATLPSLRRRRSQGIRCDYIGSSTNLIMVTTTTLMLFAGRFGLAPSANRKATAGLKLEARESGLQTGDPAGFTLADTLACGAVGHIMGVGIVLGLKNTGVLDQIIG
5 spQ6YZE2GSA_ORYSJ Glutamate-1-semialdehyde MAGAAAASAAAAAVASGISARPVAPRPSPSRARAPRSVVRAAISVEKGEKAYTVEKSEEIFNAAKELMPGGVNSPVRAFKSVGGQPIVFDSVKGSRMWDVDGNEYIDYVGSWGPAIIGHADDTVNAALIETLKKGTSFGAPCVLENVLAEMVISAVPSIEMVRFVNSGTEACMGALRLVRAFTGREKILKFEGCYHGHADSFLVKAGSGVATLGLPDSPGVPKGATSETLTAPYNDVEAVKKLFEENKGQIAAVFLEPVVGNAGFIPPQPGFLNALRDLTKQDGALLVFDEVMTGFRLAYGGAQEYFGITPDVSTLGKIIGGGLPVGAYGGRKDIMEMVAPAGPMYQAGTLSGNPLAMTAGIHTLKRLMEPGTYDYLDKITGDLVRGVLDAGAKTGHEMCGGHIRGMFGFFFTAGPVHNFGDAKKSDTAKFGRFYRGMLEEGVYLAPSQFEAGFTSLAHTSQDIEKTVEAAAKVLRRI
Note: 5 examples of proteins found in the database. The column Sequence indicates the amino acids (letter code) that make up each protein.

Look for amino acids Q and P.

Table 2: Aminoacid composition
N Accession Name totalAA Qcomp Q100 Pcomp P100
1 spP4910614331_MAIZE 14-3-3-like 261 6 2.30 7 2.68
2 spQ84Q72HS181_ORYSJ 18.1 161 4 2.48 12 7.45
3 spP69555PSBH_WHEAT Photosystem 73 1 1.37 5 6.85
4 spP36886PSAK_HORVU Photosystem 131 5 3.82 5 3.82
5 spQ6YZE2GSA_ORYSJ Glutamate-1-semialdehyde 478 8 1.67 27 5.65
Note: totalAA = Number of total amino acids of the protein
Qcomp= Number of Glutamine found in the protein
Q100= Porcentage of Glutamine in the protein
Pcomp= Number of Proline found in the protein
P100= Porcentage of Proline in the protein

Working with Protein Data

Plotting
Glutamine and Proline composition in Ryegrass

Figure 1: Glutamine and Proline composition in Ryegrass

Glutamine and Proline composition in Ryegrass

Figure 2: Glutamine and Proline composition in Ryegrass

Glutamine and Proline composition in Ryegrass

Figure 3: Glutamine and Proline composition in Ryegrass

Your figure and table captions are automatically numbered and can be referenced in the text if needed: see eg. Table 1 and Figure ??

My Digital Toolbox

To work with Protein Databases, that are usually in the format .fasta. I have used the package Biostrings. For tyding the data,Tidyverse (my new best friend) and plyr.

What digital tools have you been using in your project? Do you expect that everything will be able to be completed within R, or will you need to work with multiple tools to get the right result? Which of the digital skills needed for your project have you learned since starting Data School?

You can use all the usual R markdown features in writing a project summary, including lists:

Favourite tool (optional)

Is there a tool/package/function in particular that you’ve enjoyed using? Give it a special shout out here. What about this tool makes it your favourite?

No prizes for guessing mine:

My time went …

What parts of your project take the most time and effort? Were there any surprising challenges you encountered, and how did you solve them?

Next steps

What further steps do you wish your project could take? Or are there any new digital skills that you are keen to develop as a result of your involvement in the Data School?

My Data School Experience

This summary is mostly about your project. However we would also like to hear about other parts of your Data School experience. What aspects of the program did you really enjoy? Have you tried applying the skills you have learned in your daily work? Have you been able to transfer this knowledge to your team members? Any descriptions of the personal impact the program has had are welcome here as well!